The truth is almost never linear!
But often the linearity assumption is “good enough”
What about when its not?
Polynomials
Step Functions
Splines
Local Regression
Generalized Additive Models
Create new variables \(X_1 = X\), \(X_2 = X^2\), and so on, then treat as multiple linear regression
Not really interested in the coefficients; more interested in the fitted function values at any value \(x_0\):
Since \(\hat{f}(x_0)\) is a linear function of the \(\hat{\beta_\ell}\), can get a simple expression for pointwise-variances \(Var[\hat{f}(x_0)]\) at any value of \(x_0\). In the figure above, we have computed the fit and pointwise standard errors on a grid of values for \(x_0\). We show \(\hat{f}(x_0) \pm 2 \cdot se[\hat{f}(x_0)]\)
We either fix the degree \(d\) at some reasonably low value, else use cross-validation to choose \(d\)
Logistic regression follows naturally. For example, in the figure we model:
\(Pr(y_i > 250|x_i) = \frac{exp(B_0 + B_1x_1 + B_2x_{i}^{2} + B_3x_{i}^{3} + ... + B_dx_{i}^{d})}{1 + exp(B_0 + B_1x_1 + B_2x_{i}^{2} + B_3x_{i}^{3} + ... + B_dx_{i}^{d})}\)
To get confidence intervals, compute upper and lower bounds on on the logit scale, and then invert to get on probability scale
Can do separately on several variables—just stack the variables into one matrix, and separate out the pieces afterwards (see GAMs later)
Caveat: polynomials have notorious tail behavior — very bad for extrapolation
Can fit using \(y ~ poly(x, degree = 3)\) in formula
Another way of creating transformations of a variable — cut the variable into distinct regions
Easy to work with. Creates a series of dummy variables representing each group
Useful way of creating interactions that are easy to interpret. For example, interaction effect of Year and Age:
\(I(Year < 2005) \cdot Age, I(Year \geq 2005) \cdot Age\)
Would allow for different linear functions in each age category
Choice of cutpoints or knots can be problematic. For creating nonlinearities, smoother alternatives such as splines are available
Instead of a single polynomial in \(X\) over its whole domain, we can rather use different polynomials in regions defined by knots
Better to add constraints to the polynomials, e.g. continuity
Splines have the “maximum” amount of continuity
Linear Splines
A linear spline with knots at \(\xi_k, k = 1,...,K\) is a piecewise linear polynomial continuous at each knot
We can represent this model as:
\(y_i = \beta_0 + \beta_1b_1(x_i) + \beta_2b_2(x_i) + ... + + \beta_{K+1}b_K(x_i) + \epsilon\)
Where the \(b_K\) are basis functions:
\(b_1(x_i) = x_i\)
\(b_{k+1}(x_i) = (x_i - \xi_k) + , k = 1,...,K\)
Here the ()_+ means positive part:
Cubic Splines
A cubic spline with knots at \(\xi_k, k = 1,...,K\) is a piecewise cubic polynomial continuous derivatices up to order 2 at each knot
Again, we can represent this model with truncated power basis functions
\(y_i = \beta_0 + \beta_1b_1(x_i) + \beta_2b_2(x_i) + ... + + \beta_{K+3}b_{k+3}(x_i) + \epsilon_i\)
\(b_1(x_i) = x_i\)
\(b_2(x_i) = x_i^2\)
\(b_3(x_i) = x_i^3\)
\(b_{k+3}(x_i) = (x_i - \xi_k)_+^3, k = 1,...,K\)
Where \((x_i - \xi_k)_+^3 = \left\{\begin{matrix} (x_i - \xi_k)_+^3 \ \ if \ \ x_i > \xi_k \\ 0 \ \ otherwise \end{matrix}\right.\)
Natural Cubic Splines
Fitting splines in R is easy
bs(x, …) for any degree splines, and ns(x, …) for natural cubic splines
splines
Knot Placement
One strategy is to decide \(K\), the number of knots, and then place them at appropriate quantiles of the observed \(X\)
A cubic spline with K knots has \(K + 4\) parameters or degrees of freedom
A natural spline with \(K\) knots has \(K\) degrees of freedom
Below is a comparison of a degree-14 polynomial and a natural cubic spline, each with 15df
ns(age, df = 14)
poly(age, deg = 14)
Smoothing Splines
Consider this criterion for fitting a smooth function \(g(x)\) to some data
\(\underset{g \in S }{dsf} \sum_{i=1}^{n}(y_i-g(x_i))^2 + \lambda \int g''(t)^2dt\)
The first term is RSS and tries to make \(g(x)\) match the data at each \(x_i\)
The second term is a roughness penalty and controls how wiggly \(g(x)\) is. It is modulated by the tuning parameter \(\lambda \geq 0\)
The smaller \(\lambda\), the more wiggly the function, eventually interpolating \(y_i\) when \(\lambda = 0\)
As \(\lambda \to \infty\), the function \(g(x)\) becomes linear
The solution is a natural cubic spline, with a knot at every unique value of \(x_i\). The roughness penalty still controls the roughness via \(\lambda\)
Smoothing splines avoid the knot-selection issue, leaving a single \(\lambda\) to be chosen
The algorithmic details are beyond the scope of this course. In R, the function smooth.spline() will fit a smoothing spline
The vector of \(n\) fitted values can be written as \(\mathbf{\hat{g}}_\lambda = \mathbf{S}_\lambda\mathbf{y}\), where \(\mathbf{S}_\lambda\) is a \(n \times n\) matrix (determined by the \(x_i\) and \(\lambda\))
The effective degrees of freedom are given by: \(df_\lambda = \sum_{i=1}^{n} \left\{\mathbf{S}_\lambda \right\}_{ii}\)
Choosing \(\lambda\)
We can specify \(df\) rather than \(\lambda\)
The leave-one-out (LOO) cross-validated error is given by: \(RSS_{cv}(\lambda) = \sum_{i=1}^{n}(y_i - \hat{g}_\lambda^{-i}(x_i))^2 = \sum_{i=1}^{n}\left [ \frac{y_i- \hat{g}_\lambda(x_i)}{1-\left\{\mathbf{S}_\lambda \right\}_{ii}} \right ]^2\)
loess()
function in RAllows for flexible nonlinearities in several variables, but retains the additive structure of linear models
Can fit GAM simply using natural splines:
Coefficients not that interesting; fitted functions are. The previous plot was produced using plot.gam
Can mix terms — some linear, some nonlinear — and use anova() to compare models
Can use smoothing splines or local regression as well:
GAMs are additive, although low-order interactions can be included in a natural way using, e.g. bivariate smoothers or interactions of the form ns(age,df=5):ns(year,df=5)